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ABSTRACT 

With a vast amount of data available on online social networks, 
how to enable efficient analytics has been an increasingly impor¬ 
tant research problem. Many existing studies resort to sampling 
techniques that draw random nodes from an online social network 
through its restrictive web/API interface. While almost all of these 
techniques use the exact same underlying technique of random walk 
- a Markov Chain Monte Carlo based method that iteratively tran¬ 
sits from one node to its random neighbor. 

Random walk fits naturally with this problem because, for most 
online social networks, the only query we can issue through the 
interface is to retrieve the neighbors of a given node (i.e., no ac¬ 
cess to the full graph topology). A problem with random walks, 
however, is the “bum-in” period which requires a large number of 
transitions/queries before the sampling distribution converges to a 
stationary value that enables the drawing of samples in a statisti¬ 
cally valid manner. 

In this paper, we consider a novel problem of speeding up the 
fundamental design of random walks (i.e., reducing the number of 
queries it requires) without changing the stationary distribution it 
achieves - thereby enabling a more efficient “drop-in” replacement 
for existing sampling-based analytics techniques over online social 
networks. Technically, our main idea is to leverage the history of 
random walks to constmct a higher-ordered Markov chain. We de¬ 
velop two algorithms. Circulated Neighbors and Groupby Neigh¬ 
bors Random Walk (CNRW and GNRW) and rigidly prove that, 
no matter what the social network topology is, CNRW and GNRW 
offer better efficiency than baseline random walks while achieving 
the same stationary distribution. We demonstrate through extensive 
experiments on real-world social networks and synthetic graphs the 
superiority of our techniques over the existing ones. 


1. INTRODUCTION 
1.1 Motivation 

With the broad penetration of online social networks and the 


multitude of information they capture, how to enable a third par/jQ 
to perform efficient analytics of data available on social networks 
- specifically, to answer global and conditional aggregates such as 
SUM, AVG, and COUNT (e.g., the average friend count of all users 
living in Texas) - has become an increasing important research 
problem in the database community |15[ |20[ |19| . Applications 
of such aggregate estimations range from sociology research, un¬ 
derstanding economic indicators to observing public health trends, 
bringing benefits to the entire society. 

Technically, the challenge of data analytics over online social 
networks mainly stems from the limitations of their (web) query 
interfaces available to a third party - most online social networks 
only allow local neighborhood queries, with input being a node 
(i.e., user) and output being its immediate neighbors. Such a query 
interface makes retrieving the entire graph topology prohibitively 
expensive for a third party (in terms of query cost - due to the large 
size of real-world social networks). To address the problem, exist¬ 
ing studies resort to a sampling approach, specifically the sampling 
of nodes (i.e., users) through the restrictive interface of a social net¬ 
work |22[ |23| , to enable aggregate estimations based on the sam¬ 
pled nodes. This is also the approach we focus on in the paper. 

1.2 Existing Techniques and Their Problems 

While there have been a wide variety of social network sampling 
designs |12[|10[[TT| [7) proposed in the literature, the vast majority 
of them share the same core design: random walk, a Markov Chain 
Monte Carlo (MCMC) method which iteratively transits from one 
node to a random neighbor. The variations differ in their design 
of the transition probabilities, i.e., the probability distributions for 
choosing the random neighbor, which are naturally adjusted for the 
various types of analytical tasks they target. Nonetheless, the core 
random walk design remains the same - after performing the ran¬ 
dom walk for a number of steps, the current node Is taken as a 
sample - and, after repeated executions of random walks, we can 
generate (statistically accurate) aggregate estimation from the mul¬ 
tiple sample nodes. 

To understand the problem with this core design, it is important 
to observe the key bottleneck of its practical Implementations. Note 
that such a sampling design incurs two types of overhead: One is 
query cost: Each transition in the random walk requires one local 
neighborhood query (described above) to be issued to the online 
social network, while almost all real-world social networks enforce 
rigid query-rate limits (e.g., Twitter allows only 15 local neighbor¬ 
hood queries every 15 minutes). The second type is the local pro¬ 
cessing (time and space) overhead for recording sampled nodes and 
computing aggregate estimations. One can see that the bottleneck 

*i.e., one who is not the social network owner - examples include 
sociologists, economists, etc. 
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here is clearly the query cost - compared with the extremely low 
query rate allowed by online social networks (e.g., 1 minute/query 
for Twitter), the local processing overhead (linear to the sample size 
(13| ) is negligible. 

With the understanding that a sampling algorithm must minimize 
its query cost, the problem with existing techniques can be summa¬ 
rized in one sentence: The core random walk design requires a long 
“bum-in” before a node can be taken as a sample - since each step 
requires one query, this leads to a high query cost and therefore a 
very slow sampling process over real-world online social networks. 

To understand what the bum-in period is and why it is required, 
we note that a sample node can only be used for analytics if we 
know the bias of the sampling process, i.e., the probability for a 
node to be taken as a sample. The reason for this requirement is 
simple - only with knowledge of the sampling bias can we prop¬ 
erly correct it to ensure equal representation of all applicable tu¬ 
ples. Nonetheless, actually learning the sampling distribution with¬ 
out knowledge of the global graph topology is difficult. A desirable 
property of random walk is that it asymptotically converges (as it 
grows longer) to a stationary sampling distribution that can be de¬ 
rived without knowledge of the graph topology (e.g., with proba¬ 
bility proportional to a node’s degree for simple random walk |18| , 
or uniform distribution for Metropolis-Hastings random walk (18|). 
The number of steps required for a random walk to reach this sta¬ 
tionary distribution is the “burn-in” period which, unfortunately, is 
often quite long for real-world social networks |23|[7l. 

1.3 Our Idea: History-Aware Random Walks 

The focus of this paper is to offer a “drop-in” replacement for 
this core design (of random walk), such that existing sampling- 
based analytics techniques over online social networks, no matter 
which analytics tasks they support or graph topologies they target, 
can have a better efficiency without changing other parts of their de¬ 
sign. To do so, we shorten the bum-in period (i.e., reduce the query 
cost) of the fundamental random-walk design without changing the 
stationary distribution it achieves - ensuring the transparency of this 
change to how the core design is called upon in data analytics. 

Technically, our key idea here is motivated by an observation on 
the potential waste of queries caused by the existing random walk 
design: Most existing random walk based techniques are Markov 
Chain Monte Carlo (MCMC) methods that are memoryless - i.e., 
they do not take into account the historic nodes encountered in 
a random walk in the design of future transitions. These meth¬ 
ods, while simple, waste substantial chances for leveraging historic 
queries to speed up random walks - a waste we aim to eliminate 
with our proposed ideas. 

Our main idea developed in the paper is to introduce historic de¬ 
pendency to the design of random walks over graphs. Specifically, 
we consider the nodes already visited by the current random walk 
while deciding where to go for the next step. We start with de¬ 
veloping a simple algorithm called Circulated Neighbors Random 
Walk (CNRW). The difference between CNRW and the traditional 
random walk is sampling with and without replacement when de¬ 
ciding on the next move. Specifically, consider a random walk 
with the last move being u —> v. To determine the next move, 
the traditional random walk design is to sample uniformly at ran¬ 
dom from the neighbors of v. With CNRW, this sampling is done 
with replacement - in other words, if the random walk has trans¬ 
mitted through u ^ V ^ w before, then we exclude w from being 
considered for the next move, until the random walk has passed 
through M — > u —> a; for all neighbors x of v. We prove that, while 
CNRW shares the exact same stationary distribution (see Defini¬ 
tion [TJ as the traditional simple random walk, it is provably more 


(or equally) efficient than the traditional random walk no matter 
what the underlying topology is. 

A rationale behind the design of CNRW can be explained with 
the following simple example. Note that if a random walk over 
a large graph comes back to u —>■ u after just passing though 
M —>■ u —>■ w, it is likely an indication that v ^ w leads to a small 
component that is not well connected to the rest of the graph (as 
otherwise the probability of going back to u should be very small). 
Falling into such a small component “trap” is undesirable for a ran¬ 
dom walk, which needs to “spread” to the entire graph as quickly as 
possible. Thus, an intuitive idea to improve the efficiency of a ran¬ 
dom walk is to avoid following u —^ w again, so as to increase the 
chance of following the other (hopefully better-for-random-walk) 
edges associated with v. This intuition leads to the “circulated”, 
without-replacement, design of CNR\\j^ 

Based on the idea of CNRW, we develop GroupBy Neighbors 
Random Walk (GNRW), which further improves the performance 
of a random walk by considering not only the nodes visited by a 
random walk, but also the observed attribute values of these vis¬ 
ited nodes. To understand the key idea of GNRW, consider the 
following intuitive observation on the structure of an online social 
network: users with similar attribute values (e.g., age, interests, oc¬ 
cupation) are more likely to be connected with each other. Lever¬ 
aging this observation, to decide which neighbor of v to transit to 
from M —> u, GNRW partitions all neighbors of v into a number of 
strata according to their values on an attribute of interest (often the 
measure attribute to be aggregated - see discussions in Section [4T^ . 
Then, GNRW first selects a stratum uniformly at random without 
replacement - i.e., it “circulates” among the strata until selecting 
each stratum once. Then, GNRW chooses a neighbor of v from the 
selected stratum, again uniformly at random without replacement. 
This neighbor then becomes the next step. 

Similar to the theoretical analysis of CNRW, we also prove that 
GNRW shares the same stationary distribution as the traditional 
random walk. Intuitively, the design of GNRW aims to “speed up” 
the random walk by alternating between different attribute values 
faster, instead of getting “stuck” on a small component of the graph 
sharing the same (or similar) attribute value. We shall demonstrate 
through experimental evaluation that GNRW can significantly im¬ 
prove the efficiency of random walks, especially when the ultimate 
objective of these random walks is to support aggregate estimations 
on attributes used for stratification in GNRW. 

1.4 Summary of Contributions 

We summarize the main contributions as follows. 

• We consider a novel problem of leveraging historic transi¬ 
tions to speed up random walks over online social networks. 

• We develop Algorithm CNRW which features a key change 
from traditional random walks: instead of selecting the next 
transition by sampling with replacement from all neighbors 
of the current nodes, CNRW performs the sampling with¬ 
out replacement, thus becoming a history-dependent random 
walk. 

• We develop Algorithm GNRW which further improves the 
efficiency of random walk by leveraging not only the history 
of nodes visited by a random walk, but also the observed 
attribute values of these visited nodes. 


^One might wonder why the circulation is conditioned upon trav¬ 
eling through an edge (i.e., u —>■ w) again rather than traveling 
through a vertex (say v) again - this is indeed a subtle point which 
we further address in Section 3.2. 
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• We present theoretical analysis which shows that, while CNRW 
and GNRW both produce samples of the exact same distribu¬ 
tion as the traditional random walk, they are provably more 
efficient no matter what the underlying graph topology is. 

• Our contribution also includes extensive experiments over 
real-world online social networks which demonstrate the su¬ 
periority of CNRW and GNRW over traditional random walk 
techniques. 

1.5 Paper Organization 

This paper is organized as follows. Section \2\ describes prelim¬ 
inaries of random walks. Sectionj^and sectionld] introduce Circu¬ 
lated Neighbors Random Walk (CNRW) and Groupby Neighbors 
Random Walk (GNRW). Sectionj^is about two discussions of our 
algorithms. Section]^ shows the results of our experiments. Sec¬ 
tion |7] overviews the related works. Section provides our brief 
conclusion. 

2. PRELIMINARIES 

In this section, we inhoduce preliminaries that are important for 
the technical development of this paper. Specifically, we start with 
introducing the access model for online social networks, followed 
by a discussion of the most popular method of sampling an online 
social network - random walks. Specifically, we shall introduce two 
important concepts related to random walks, order and stationary 
distribution of a Markov chain, which are critical for the develop¬ 
ment of our techniques later in the paper. Finally, at the end of 
this section, we define the key performance measures for a random 
walk based sampling algorithm. 

2.1 Access Model for Online Social Networks 

As a third party with no direct access to the backend data reposi¬ 
tory of an online social network, the only access channels we have 
over the data is the web and/or API interface provided by the online 
social network. While the design of such interfaces varies across 
different real-world online social networks, almost all of them sup¬ 
port queries that take any user ID u as input and return two types 
of information about u: 

• N{u), the set of all neighbors of u, and 

• all other attributes of u (e.g., user self-description, profile, 
posts). 

Note that the definition of a “neighbor” may differ for different 
types of social networks - e.g., Twitter distinguishes between fol¬ 
lowers and followees and thus features directed connections and 
asymmetric neighborhoods (i.e., a neighbor of u may not have u in 
its neighbor’s list), while Google Plus features undirected “friend¬ 
ship” edges and thus symmetric neighborhoods. For the purpose of 
this paper, we consider undirected edges - i.e., Vi; G N{u), there 
is u € N{v). Note that online social networks that feature directed 
edges can be “casted” into this undirected definition. For exam¬ 
ple, we can define an undirected edge if either edge n —>■ w or 
D —>■ ii exists. Given the definition of node and neighbors, we con¬ 
sider the social network topology as an undirected graph G{V, E). 

V is the set of all the users (i.e. vertices/nodes) and E is the set 
of all the connections between two users. E = {euv\u,v € V}, 
where is the connection between user u and user v. Also, we 
use fc„ to denote the degree of node v, where fc„ = |W(t;)|. 

Many real-world online social networks also impose a query rate 
limitation. For example, Twitter recently update their API rate lim¬ 
its to “15 calls every 15 minute^” Yelp offers only “25,000 API 

'^https://dev.twitter.com/rest/public/rate-limiting 


calls per da}|^’. 

2.2 Random walk 

Almost all-existing works on sampling an online social network 
through its restrictive web/API interface (as modeled above) are 
variations of the random walk techniques. We briefly discuss the 
general concept of a random walk and a specific instance, the sim¬ 
ple random walk, respectively as follows. 

2.2.1 Random Walk as an MCMC Process 

Order: Intuitively, a random walk on an online social network 
randomly transits from one node to another according to a pre¬ 
determined randomized transition algorithm (and the neighborhood/node 
information it has retrieved through the restrictive interface of the 
online social network). From a mathematical standpoint, a random 
walk can be considered a Markov Chain Monte Carlo (MCMC) 
process with its state Xi being the node visited by the random walk 
at Step i and transition probability distribution 

Pr(-Yji — Xn I Xn—\ — 1,..., Xr — x±'j 

= Pr(Aji = Xn I Xn — l — Xn — 1 , . . . , Xn — m — Xn — m) 

where Xi € V. Flere m (1 < m < n) is the order of the Markov 
Chain. Most existing random walk techniques over online social 
networks are first-ordered (i.e., with m = 1). That is, with these 
random walks, the next transition only depends upon the current 
node being visited, and is independent of the previous history of the 
random walk. For higher-order random walks, the next transition 
is determined by not only the current node, but also an additional 
m — 1 steps into the history as well (e.g., the non-backtracking 
simple random walk (NB-SRW) GD has an order of m = 2, as 
it avoids transitioning back to the immediate last node whenever 
possible). Note that the random walk technique we will present in 
the paper has a much higher order than these existing techniques - 
more details in Sectionj^and Section|4] 

Stationary Distribution: The premise of using random walk to 
generate statistically valid samples is that, after performing a ran¬ 
dom walk for a sufficient number of steps, the probability for the 
random walk to land on each node converges to a stationary distri¬ 
bution TT. In other words, after sufficient number of steps - T steps, 
if the random walk has stationary distribution, then the probability 
of staying at node u after T steps is the same as after T -|- 1 steps. 

The convergence has been proved to hold as long as the transition is 
irreducible and positive recurrent (l6) . Usually a Simple Random 
Walk (defined in section [2.2.2| ( on a connected non-bipartite graph 
has a stationary distribution. 

Definition 1. [Stationary distribution]. Let itj denote the 
long run proportion of time that a random walk spends on node j 
& 

n 

Ttj = lim - ^ Pr{X^ = j}, (1) 

n—^ooTL ‘ ^ 
m=l 

One can see that the stationary distribution of a random walk 
determines how we can deal with the samples, and it enables us to 
further analyze the samples like estimating certain aggregates. 

2 . 2.2 Simple Random Walk 

In the current literature of sampling online social networks through 
their restrictive interfaces. Simple Random Walk (SRW) is one of 
the most popular techniques being used. Simple random walk is an 

"^http://www.yelp.com/developers/faq 
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order-1 Markov chain with transition distribution being uniform on 
all nodes in the neighborhood of the current node Xi. Formally, we 
have the following definition. 


Definition 3. [Asymptotic Variance]. Given a/uncfion /(•) 
and an estimator fin = ^ asymptotic variance of 

the estimator is defined as 


Definition 2. [Simple Random Walk]. Given graph G(V,i5), 
and a node v G V, a random walk is called Simple Random Walk if 
it chooses uniformly at random a neighboring node u G N(v) and 
transit to u in the next step. 


( l/kn ifuGN{v), 

1 0 otherwise. 


( 2 ) 


Corresponding to this transition design, one can easily compute 
the stationary distribution for simple random walk as 


TTv — 


2\E\ 


(3) 


That is, SRW selects each node in the graph with probability 
proportional to its degree. In later part of the paper, we shall show 
how our proposed technique achieves the same sampling distribu¬ 
tion while significantly improving the efficiency of random walks. 


2.3 Performance Measures for Sampling 

There are two important objectives for sampling from an online 
social network: minimizing the sampling bias and minimizing the 
query cost. We define the two performance measures correspond¬ 
ing to the two objectives respectively as follows. 

Sampling Bias: Intuitively, sampling bias measures the “distance” 
between the stationary distribution of a random walk and the ac¬ 
tual sampling distribution achieved by the actual execution of the 
random walk algorithm. To understand why this is an important 
measure, we note that there is an inherent tradeoff between such 
a distance and the query cost required for sampling: If one does 
not care about the sampling bias and opts to stop a random walk 
right where it starts, the sampling bias will be the distance between 
the stationary distribution and a probability distribution vector of 
the form {1, 0,..., 0} (i.e., while the starting node has probability 
100% to be sampled, the other nodes have 0%) - i.e., an extremely 
large sampling bias. 

We propose to measure sampling bias with three metrics for var¬ 
ious purposes in this paper: KL-divergence j^, ^ 2 -distance and 
a “golden measure” of the error of aggregate estimations produced 
by the sample nodes. Specifically, while the former two measures 
can be used in theoretical analysis and experimental analysis on 
small graphs, they are infeasible to compute for large graphs used 
in our experimental studies - motivating us to use the last measure 
as well. We shall further discuss the bias and the accuracy of vari¬ 
ous random walk algorithms in the experiments section|^ 

Query Cost. Another key performance measure for sampling over 
an online social network is the query cost - i.e., the number of 
queries (as defined in the access model described in Section HU 
one has to issue in order to obtain a sample node. Note that query 
cost here is defined as the number of unique queries required, as 
any duplicate query can be immediately retrieved from local cache 
without consuming the aforementioned query rate limit enforced 
by the online social network. 

Asymptotic Variance. Variance is one of the most important 
things to justify the performance of random walk algorithms. Of 
course, we can measure the sampling bias given certain query cost, 
but after sufficient number of steps, the efficiency of an estimator 
provided by random walk samples is directly tied to the variance, 
which is also important for us to theoretically compare the perfor¬ 
mance of random walks. For the purpose of this paper, we adopt 
the commonly used asymptotic variance in Markov Chain. 


Vooifi) = lim nVar{fin) (4) 

n —^oo 

Note that this does not depend on the initial distribution for Xq. 
In addition, in practice, we would use an estimator based on only 
Xt with t greater than some very large number - say h - that we 
believe the chain has reached a distribution close to its stationary 
distribution it after h steps. 

3. CNRW: CIRCULATED NEIGHBORS 
RANDOM WALK 

In this section, we develop Circulated Neighbors Random Walk 
(CNRW), our first main idea for introducing historic dependency 
to the design of random walks over graphs. Specifically, we start 
with describing the key idea of CNRW, followed by a theoretical 
analysis of (1) its equivalence with simple random walk (SRW) in 
terms of (stationary) sampling distribution, and (2) its superiority 
over SRW on sampling efficiency. Finally, we present the pseudo 
code for Algorithm CNRW. 

3.1 Main Idea and Justification 

Key Idea of CNRW: In traditional random walks, the transition at 
each node is memoryless - i.e., when the random walk arrives at a 
node, no matter where the walk comes from (i.e., what the incom¬ 
ing edge is) or which nodes the walk has visited, the outgoing edge 
is always chosen uniformly at random from all edges attached to 
the node. The key idea of CNRW is to replace such a memory¬ 
less transition to a stateful process. Specifically, given the previous 
transition of the random walk u —>■ u, instead of selecting the next 
node to visit by sampling with replacement from N{v), i.e., the 
neighbors of v, we perform such sampling without replacement. 

Such a change is demonstrated through an example in Figure [T] 
When a CNRW transits through u ^ v for the first time, it selects 
the next node to visit in the same way as traditional random walk - 
i.e., by choosing w uniformly at random from N{v). Nonetheless, 
if the random walk transits through u —>■ u again in the future, 
instead of selecting the next node from N{v), we limit the choice 
to be from N{v) — {w}. One can see that, given a transition u —>■ 
V, our selection of the next node to visit is a process of sampling 
without replacement from N{v). This, of course, continues until 
Vui G N{v), the random walk has passed through w —>■ u —>■ m, 
at which time we reset memory and restart the process of sampling 
without replacement. Also, we introduce a notation b{u, v), which 
is defined as a set of nodes in N (v) that we have passed through. 
Thus, we generalize the idea of CNRW as: 

1. Each time when the random walk travels from u to v, we 
uniformly choose the next candidate node w from N(v) — 
b(u, v). 

2. Let b(u, v) 6(u, t;)u{w}. If b(u, v) = N(v), let b(u, v) = 

0 . 


Intuitive Justification: To understand why CNRW improves the 
efficiency of the sampling process, we start with an intuitive expla¬ 
nation before presenting in the next subsection a rigid theoretical 
proof of CNRW’s superiority over the traditional simple random 
walk (SRW) algorithm. 

Intuitively, the justification for CNRW is quite straightforward: 
If a random walk travels back to u —>■ u after only a small number 
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(h)N{v) — {w} (c) N(v) — {w, q} (d) N{v) — {w,q,u} (e) N{v) 

Demo of CNRW, it chooses the next candidate from the set N(v) — b(u, v). 


of steps, it means that the choice of w —>■ w is not a “good” one for 
sampling because, ideally, we would like the random walk path to 
“propagate” to all parts of graph as quickly as possible, instead of 
being stuck at a small, strongly connected subgraph like one that 
includes u, v and w. As such, it is a natural choice for CNRW to 
avoid following v ^ w again in the next transition, but to instead 
choose a node from N{v) — {w} uniformly at random. 

3.2 Theoretical Analysis 

In this section, we will first introduce the concept of path blocks 
and then theoretically show that CNRW and SRW have the same 
stationary distribution - with the fundamental difference between 
the two being the how path blocks alternate during the random walk 
process. Finally, we prove that CNRW produces more accurate 
samples (i.e. with smaller or equal asymptotic variance than SRW) 
regardless the graph topology. 

Path blocks: Intuitively, path blocks divide a random walk’s path 
into consecutive segments. Formally, we have the definition of path 
blocks: 

Definition 4. [Path blocks]. Given a random walk over graph 
G{V, E), denote its path as Xq, Xi,..., Xm, ■ ■ ■, where Xi G V. 
A path block Bij is defined as 

Bij = {Xi, Xi+i,..., Xj}, j > i. (5) 

An interesting observation critical for the design of CNRW is 
that for path block Bij, Xi can be the same as Xj, which is called 
recurrence in a Markov Chain As a typical positive recurrent 
Markov Chain, SRW will always go back to the same node if the 
number of steps is sufficiently large. Formally, positive recurrence 
means that if we denote Mi = = i\Xo = i},i G V, 

then its expected value E{Mi) is always finite. The reason why 
recurrence is an important concept for CNRW is because, as one 
can see from its own, the key difference between CNRW and SRW 
is how we select the next transition when a recurrence happens. 

As mentioned in the introduction, for the purpose of CNRW de¬ 
sign, one actually has two choices on how to define a recurrence - 
based on edge-based and node-based path blocks, respectively: 

Edge-based: recurrence if Xi = Xj-i, Xi+i = Xj (6) 
Node-based: recurrence if Xi = Xj (7) 

In CNRW (and the following discussions in GNRW), we choose the 
edge-based design for the following main reason: Note that with 
the edge-based definition, path blocks separated by recurrences have 
much longer expected length than with the node-based design, be¬ 
cause it takes much more steps for a random walk to travel back 
to an edge than a node. As such, edge-based path blocks tend to 
have similar distribution for each block (proved in Theorem [^, 
leading to a smaller inter-pathblock variance, which in turn results 
in a more significant reduction of asymptotic variance brought by 



Figure 2: Path blocks in CNRW 

CNRW’s without-replacement design (proved in Theorem]^. We 
also conducted extensive experiments to verify the superiority of 
edge-based design - though the results are not included in this pa¬ 
per due to space limitations. 

Sampling Distribution: As discussed in Section [2^ for SRW, the 
sampling distribution is ttj = kj/2\E\. CNRW, on the other hand, 
is a higher-order Markov Chain. As such, the sampling distribution 
of CNRW is calculated as in equation 0. We will introduce a 
specific kind of path blocks first. 

Definition 5. [Path blocks i3(() rooted on edge eu„]. Given 
a random walk with subsequence 

•••—I'Ufi— where i u, if there does not exist 
j G [1, — 1] with Ui = u and Ui+i = v, then we call the prefix of 

this subsequence ending on Uh - i-e., u ^ v ^ i ^ U\ ^ U 2 ^ 

^ Uh an instance o/path block B{i) rooted on Cuv 

What we really mean here is that once a random walk reaches 
M —>■ n for the first time, the remaining random walk can be par¬ 
titioned into consecutive, non-overlapping path blocks rooted on 
edge Buv, each of which starts with u —>■ w and ends with the last 
node visited by the random walk before visiting u —>■ u again. Us¬ 
ing this definition, the following theorem indicates the equivalence 
of sampling distributions of CNRW and SRW. 

Theorem 1. Given a graph G{V, E), CNRW has the station¬ 
ary distribution ti{v) = k„/2\E\. 

Proof. We will construct the proof in two steps. 

• First, we prove that the stationary distribution is the same 
(i.e.. Vs G V, 7r(s) = ks/ (2|i5|) where ks is the degree of s 
for simple random walk) after applying CNRW to any single 
edge of the graph. 

• Second, we prove that, VA: > 1, if the stationary distribution 
is the same after applying CNRW to any set of (A: — 1) edges, 
then the stationary distribution will remain the same after we 
apply CNRW to any additional (A:-th) edge in the graph. 

Step 1. 
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We start with the first step - i.e., applying CNRW over only 
one edge, say u —>■ v. We denote this CNRW as CNRW(l). Let 
SRW be the original simple random walk. The most critical con¬ 
cept for our proof is a path block B{i) for each neighbor of v, i.e., 
i G N{v). According to the definition in the revised version (also 
discussed in response to D2), B{i) represents a segment of the ran¬ 
dom walk that starts with u ^ v ^ i and ends before the random 
walk travels back through u ^ v next time - e.g., if a segment 
of random walk isM—>- 1 )—>-4—>■ ••• —>■ x —>■ u —>■ v —>■ 
(the two abbreviated parts in the mid¬ 
dle do not contain u —^ u), then this segment can be rewritten as 
B{i) —>■ B{j) —> M —>■ u, where B{i) = m—>■?;—>- 1 —>•••— 
and B{j) = u^v^j^---^y (here x and y can be any 
arbitrary node that has an edge linked to tQ. 

Let and be a random walk performed by 

CNRW(l) and SRW, respectively. Let (resp. X^^) 

be the node accessed by (resp. at the m-th 

step. The sampling probability for any node j £ V according to 
the stationary distribution of the random walks are 


SRW t .N 

^ u) 


lim 

n—>-oo 


lim 

n—¥oc> 


=j} 

n 

=j} 


( 8 ) 

(9) 


We now consider y^ivj?w{i) ySRW^^ i.e., an infinite sub¬ 
sequence of (resp. which includes all steps 

occurring after the first visit of m —^ n. There are two important 
observations here: 

• First, both yCiVfltvci) ySHtv can be completely sep¬ 

arated into path blocks B{ii) —>■ B(* 2 ) —>■ B(* 3 ) —>■ • • • 
where V/i,ih £ N{v). The sole difference between CNRW(l) 
and SRW is on how a path block is chosen (i.e., CNRW(l) 
guarantees i\ ^ ^ ^ *|iv{u)| while SRW does not). 

The internal transitions within a path block, on the other 
hand, are indistinguishable for both approaches - i.e., for 
CNRW(l), once B{ii) is chosen for a slot, the internal tran¬ 
sitions within B(i\) follow the exact same probability distri¬ 
bution as that of a B{i\) chosen for SRW. 

• Second, since yCJVHtv{i) equivalent with 

X^nrw(i) (j.ggp appearance of « 

u, as long as M —>■ u appears in the random walk 
there must be 


lim 

n—Foo 


= lim 

n—Foo 


lim 

n—Foo 


n 

n 

=j} 


= lim 


=j} 


( 10 ) 

( 11 ) 

( 12 ) 

(13) 


Note that if tr —>• w never appears in then this 

Step is already proved: Since CNRW(l) never kicks in, it of 
course yields the exact same stationary distribution as SRW. 

Note that, according to the design of CNRW, once we rewrite 
yCNRWii) p^jjj blocks, i.e., B{ii) —^ B{i 2 ) —> B(* 3 ) —> 
• • •, we will see that all |W(t;)| path blocks appear in an iterative 

^Note that, as further discussed in our response to D2, B{i) and 
B{j) are not overlapping on tt —>■ w. 


fashion (e.g., no path block B{ix) appears twice before another 
B{iy) appears once). In other words. 


lim 

n—¥oo 


=i} 


^iSN(v) lltUn'-J-oo 




jN(v)j ■ n' 

^ieN(v) Et^i Pr{B{i)b — j} 


= lim 

n—Foo 


|iV(u)| 


(14) 

(15) 

(16) 


where B(i)b is the node at Step h of path block B{i). 

On the other hand, consider the rewrite of to path blocks. 

In this case, the path blocks do not appear in an iterative fashion. In¬ 
stead, each path block is generated i.i.d. uniformly from all | A’(ti)| 
possible path blocks. Let P{B{i)) be the probability for Bi to be 
chosen at each slot - one can see that P{Bi) — l/|A(u)|. Thus, 


lim 

n—FOO 


=j} 


lim ^ P(B(*)) 

T 7 ,—Fno ' ^ 


YX^,Pr{B(i)b=j} 


= lim 

n—^oo 


i^N {v) 

Ei 6 iv(.) ELi Pr{B{i)b = j} 


lim 

n —^oo 


I A(t;)| • n 
One can observe from OH and OH that 


(17) 

(18) 

(19) 


= J} ^ j:i=,Pr{Yr^ =j} ^ 

n—Foo n 

( 20 ) 


This, combined with ([^, l[^ and jl l[ l, OH’ 1®®^!^ 1° 

( 21 ) 


i.e., the stationary distribution achieved by CNRW(l) is exactly the 
same as that of SRW. 

Step 2. 

Now we need to prove the second step: if we assume that sta¬ 
tionary distribution is the same after applying CNRW to any set of 
(fc — 1) edges, then the stationary distribution will remain the same 
after we apply CNRW to any additional A:-th edge in the graph. We 
denote this CNRW as CNRW(fc), and let u —^ u be the additional 
fc-th edge. Our definition of path blocks B (i) for i £ N{v) remains 
exactly the same as in Step 1. Note that 


^CNRW(k)^j^ 

^CNRW(k-l)^j^ 


lim 

n—^oo 


lim 

n —>-oo 


e:.iMe 


CNRW{k) 


= j} 


( 22 ) 




(23) 


The difference between CNRW(k) and CNRW(k-l) is only on 
how a path block is chosen. For CNRW(k), once B{ii) is cho¬ 
sen for a slot, the internal transitions within B{ii) follow the exact 
same probability distribution as that of aB{ii) chosen for CNRW(k- 
1). Therefore, according to the design of CNRW, once we rewrite 
CNRW(k) to path blocks, i.e. B{ii)B{i 2 )B{i 3 ) ..., we will get: 


lim 

n —^oo 


j:i=tPr{Y;. 


CNRWW^ ^ ^ .J 


|A(v)| • n 
(24) 
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And CNRW(k-l) is just SRW based on the route u —^ u (since 
edge eu,D is the additional fe-th edge that has been applied by CNRW(k)), 
B{i) is chosen uniformly at random in CNRW(k-l). Thus, 


lim 

n—¥oo 




(25) 


= lim V P{B{i)) ■ ^ 26 ) 

71. —oo ‘ ^ n 


= lim 

n—¥oo 


i^N{v) 

Tli^N^v) 5^6=1 P'l'{B{i)b = j'} 


(27) 


lA’ju)! • n 

We can combine the above equations {12\ - which leads to 


^CNRW(k)^.^ ^ ^CNRWi.k-l)^.y 

□ 

One can see from the theorem that, for both SRW and CNRW, the 
probability for a node to be sampled is proportional to its degree. 
As such, CNRW can be readily used as a drop-in replacement of 
SRW. 

Improvement on Sampling Efficiency: Having proved that CNRW 
and SRW produce the same (distribution of) samples - we now de¬ 
velop theoretical analysis on the efficiency of CNRW and its com¬ 
parison with SRW. Note that, since the sampling distributions of 
both CNRW and SRW converge asymptotically to the stationary 
distribution (i.e., probability proportional to a node’s degree), for a 
fair comparison between the efficiency of these two algorithms, we 
must first establish a measure of the distance between the sampling 
distribution achieved at h steps and the (eventual) stationary distri¬ 
bution. The reason is that, only when such a distance measure is 
defined, we can then compare the number of steps each algorithm 
requires to reduce the distance below a given threshold e - appar¬ 
ently, the one which requires fewer steps achieves better sampling 
efficiency. 

For the purpose of this paper, we follow the commonly used 
asymptotic variance defined in section [2^ While there are mul¬ 
tiple ways to understand the “physical meaning” of this measure, 
a simple one is that it measures the mean square error (MSB) one 
would get for an AVG aggregate (on the measure function /(■) used 
in the definition) by taking into account the nodes encountered in 
the first h steps of a random walk. One can see that, the lower 
this MSB is, the closer the sampling distribution at Step h is to the 
stationary distribution. 

With the asymptotic variance measure, we are now ready to com¬ 
pare the sampling efficiency of CNRW and SRW. The following 
theorem shows that for any measure function /(•) and any value 
of h (i.e., the number of steps), CNRW always achieves a lower 
or equal variance - i.e., better or equal sampling efficiency - than 
SRW. Before presenting the rigid proof, we first briefly discuss the 
intuition behind this proof as follows. 

The proof is constructed through induction. As such, we first 
consider the impact of CNRW on changing the transition after one 
edge, i.e., as in the above running example, the transition after u —> 
V. A key concept used in the proof is a segmentation of a long 
random walk into segments according to u —^ u. Specifically, every 
segment of the random walk except the first one starts and ends 
with u —> u. Figure|^shows an example of this segmentation. We 
can see that CNRW are generating segments containing alternating 
path blocks. 

With the segmentation, a key idea of our proof is to introduce 
an encoding of each segment according to the first node it visits 


after u —> u. For example, in the case shown in Figurej^where v 
has three neighbors {u, w, q}, we have three possible path blocks: 
{B{u), B{w), B{q)}. It is important to note that, since we are only 
considering the change after u —>■ w at this time, every path block 
can be considered as being drawn from the exact same distribution 
no matter if CNRW or SRW is used - because CNRW does not 
make any changes once the first node after u —>■ u, i.e., the code, is 
determined, until the next time u —>■ u is visited. This observation 
enables us to simple consider CNRW and SRW as sequences of 
codes (i.e. we can map path blocks into codes) in the efficiency 
comparison - as what happens within each path block is anyway 
oblivious to whether CNRW or SRW is being used. 

Now studying the sequence of codes for CNRW and SRW, we 
make a straightforward yet interesting observation: given a se¬ 
quence containing {B{u), B{w), B{q)} with length h, the num¬ 
ber of occurrences of B{u), B{w), and B{q) will be the same (at 
least within ±1 range) in CNRW, because they will alternately ap¬ 
pear every three codes. On the other hand, with SRW, the number 
of occurrences for {B{u), B{w), B{q)} is statistically equal but 
have an inherent variance in practice due to randomness. The elim¬ 
ination of this randomness/variance is exactly why CNRW tends 
to generate samples with a smaller variance than SRW - as rigidly 
shown in the following theorem. 

Theorem 2. Given a graph G{V,E), any property function 
f, and the following two estimators for p, based on SRW (p) and 
CNRW(p'): 

- n 1 ^ 

ftr.= -Y,f{Xt), p'r. = -Y,f{X[), (29) 

then CNRW will have no greater asymptotic variance (defined in 
Section ^^ than SRW 

Voa{p') < Voo{p)- (30) 


Prooe. The claim of asymptotic variance in Theorem|^can be 
divided into 3 steps: 

1. We reduce the problem to comparing asymptotic variance 
when the transition matrix P and P' (for SRW and CNRW 
accordingly) differ only for transitions involving one node’s 
neighbors. In other words, given nodes u, v, we only con¬ 
sider the circulated neighbors rule for CNRW when w —>■ u, 
not all the nodes in G. 

2. We then divide the random walk’s trace into path blocks. 

These blocks are all starting with the route of u —l u —>■ i, 
i G N(v). 

3. We see that blocks are equally likely, but may have difference 
distribution for their contents. Blocks in CNRW are alternat¬ 
ing, and they will have lower or equal asymptotic variance. 

Step I. Looking at one node’s neighbors is enough. If we can 
prove for only one node’s neighbors that CNRW achieves a smaller 
asymptotic variance, we can recursively apply it to each node and 
its neighbors. Therefore, given a specific node v £ V, one of its 
neighbor u, we only need to consider the circulated neighbors rule 
to N(v) with an incoming route u —> u. 

Step 2. For all the neighbors of v: i € N{v), we have path 
blocks B{i) rooted on edge e„v For example, B{'w), B(q) are 
two blocks in Figure]^ 

Step 3. We will show that the alternating appearance of the 
blocks B(i) lowers (or at least not increase) the asymptotic vari¬ 
ance. We note that the probability of the blocks P(B(i)) = P{B{j)),i,j £ 
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CNRW 



B(w) 

B(u) 

B(q) 

B(w) 

B(u) 

B(q) 
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B(w) 


Segments of alternating path blocks. They are circulated with a period of 3. 
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B(u) 
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B(w) 
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B(q) 
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Path blocks can be consecutive in SRW. 


Figure 3: Comparison of the block distribution in CNRW and SRW. 


N{v), because P{u —k u —k i) = P{u —k w —k j). Also, if we 
denote the number of block B{i) before M blocks as (M), 

then 


Proof. We denote the edge Cuw (w € Gi,w € G 2 ) as the 
bridging edge in G. Now we only need to consider the transition 
probability of u. For SRW, 


\KB(i){M)-KBU){M)\<l, yi,jeN{v),M>l (31) 


holds for CNRW but not SRW. The sampling for the blocks is there¬ 
fore stratified in CNRW. According to a Lemma in Q^, the asymp¬ 
totic variance of CNRW is no more than SRW. 

Lemmap^: Let Z\, Z 2 , ■ ■ ■ be an irreducible Markov chain with 
state space { 0 , 1 , 2 }, whose invariant distribution, p, satisfies p(0) = 
p(l). Let for z = Q,l,2 be distributions for pairs {H, L) € 

R X having finite second moments. Conditional on Zi, Z 2 , ■ ■ 

let {Hi, Li) be drawn independently from Qzi- Define 

( Zi if Zi = 2 

I Z,, + '^I{a^^y{Z,) {modulo!) ifZi^2 

I 


where k = min{f : Zi 7 ^ 2}. (In other words, the Z) are the 
same as the Zi except that the positions where 0 or 1 occurs have 
their values changed to a sequence of alternating Oi and 1j.) Con¬ 
ditional on Zi, Z2, ■ ■ ; let {H'i, L)) be drawn independently from 
Qz'.- Define two families of estimators as follows: 

n n n n 

= (33) 

i=l i=l 2 = 1 2 = 1 


Then the asymptotic variance of R' is no greater than that of R. In 
other words, 


lim nVar{R'„) < lim nVar{Rn) (34) 

n—>-oo n—¥oo 

This lemma justifies the claim that partial stratification of sam¬ 
pling for blocks cannot increase asymptotic variance. And it is eas¬ 
ily to extend to the case the state space is { 0 , 1 , 2 ,..., fc} as long 
as we keep the stratification of the states. In applying this lemma 
to the proofs in our theorem, Z\,Z 2 , ■ ■ ■ are identifiers for the type 
of each block, we can use 0 = B{io), 1 = B{if),... etc. Please 
refer to GD for more details. □ 


Theorem [^establishes CNRW’s superiority while the following 
Theorem [^ shows how significant the superiority over a concrete 
example: barbell graph. The probability of propagating from one 
subgraph to another in CNRW is much greater than the one in SRW. 


PSRW = P{u —k w) = 


(36) 


\N{u)\ |Gi|’ 

where |Gi| is the number of nodes in Gi. For CNRW, 

PCNRW = P{U P(u-kw|s-kM) 


> 


IGil-1 

1 

|G^ 

|Gi| 


IGil-1 


{Gi — 22 } 

(37) 

IGlI-l , 



(38) 

rlN(u)l ^ 


1 —dx 

(39) 

Jl X 


ln|Gi| 

(40) 

^ In |GrM Psrw 

(41) 


Eq l |37[ l => l |38| l because the previous accessed neighbors are evenly 
distributed among W(u). □ 


3.3 Algorithm CNRW 


Algorithm 1 Circulated Neighbors Random Walk 

/* Given xq = u,Xi = v, b{xo, xi) = 0 */ 

/* Function b{u, v) can be implemented as a HashMap, and their 

initial value are ah 0 . */ 

for i = 2 —k samplesize do 

/* S denotes the next possible candidates */ 

S t- N{xi-i) - b{xi- 2 ,Xi-i) 
if 5 7 ^ 0 then 

Xi t— uniformly choose a node from S 
b{xi- 2 , Xi-fij = b{xi- 2 , Xi-i) U {xi} 

else 

Xi t— uniformly choose a node from N{xi-i) 

6 (xi_ 2 ,Xi_i) = 0 

end if 
end for 


Theorem 3. Given a barbell graph G, which contains two 
complete subgraphs Gi, G 2 . If we choose the initial node v G Gi, 
then the expected value oftcNRW and tsRW satisfy that 


PcNRW 

PsRW 


> 


|G'r| 

IGil-1 


ln|Gi| 


(35) 


where Pcnrw and Psrw are the probability for CNRW and SRW 
to travel from Gi to G 2 . 


Algorithm implementation. Algorithm[T]depicts the pseudo code 
for Algorithm CNRW. We note that the only data structure we 
maintain (beyond what is required for SRW) is a historic hash map 
of outgoing transitions b{u, v) for each edge Cuv we pass through 
- i.e., if the random walk has passed through edge e^v before, then 
b{u, v) contains the neighbors of v which have been chosen previ¬ 
ously (during the walk) as the outgoing transitions from v. 
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Figure 4: An example of partitioning a node’s neighbors into 3 
groups. 


than SRW. This is formalized in the following theorem, the proof 
of which can be constructed in analogy to Theorem|2 

Theorem 4. Given a graph G{V, E), GNRW has the station¬ 
ary distribution 7r(u) = fc„/2|i?|. And for any property function f, 
the following two estimators for p based on SRW (p) and GNRW 

(p*): 

n n 

= = (42) 

t=l t=l 

then GNRW will have no greater asymptotic variance than SRW 
Vooip*) < Vooip)- (43) 


Time and space complexity. CNRW requires a hash map that 
continuously records the outgoing transitions for each edge, i.e. the 
key-value pair: Cuv —>■ b{u,v). Assume CNRW walks K steps, 
and the keys are uniformly distributed among all the edges in 
the graph for CNRW (the same as SRW), then this hash map as a 
whole occupies expected 0{K) space (according to if we are 
using dynamic perfect hashing). Also, each step’s amortized expect 
time complexity is 0(1), so the total expected time complexity for 
K steps is 0{K). 

4. GNRW: GROUPBY NEIGHBORS 
RANDOM WALK 

4.1 Basic Idea 

Recall from Section [TT] the main idea of CNRW: Given an in¬ 
coming transition « —>■ n, we essentially circulate the next transi¬ 
tion among the neighbors of v, ensuring that we do not attempt the 
same neighbor twice before enumerating every neighbor. The key 
idea of GNRW is a natural extension: Instead of performing the 
circulation at the granularity of each neighbor (of v), we propose 
to first stratify the neighbors of v into groups, and then circulate 
the selection among all groups. In Figure]^ for example, if we 
visit u —>■ n for the second time, with the last chosen transition be¬ 
ing from V to a(nother) node in S 2 , then this time we will randomly 
pick one group from S\ and S 3 , (with probability proportional to the 
number of not-yet-attempted transitions in each group), and then 
pick a node from the chosen group uniformly at random. GNRW 
can be summarized as: 

1. It has a global groupby function gf) that will partition a 
node’s neighbors into disjoint groups. For example, (/(W(ii)) = 
{Si,S2,...,S^}. 

2. Each time when the random walk travels from u to v, we 
choose the next candidate group Si from N(v) — S{u,v) 
with probability |S'i|/|Af(n) — Sj-u, n)|, where S{u,v) is a 
set of groups we have accessed before. 

3. Within Si, we uniformly at random choose the next candi¬ 
date node w from Si — bs.^^ {u, v), where bg. (u, v) is defined 
similar to CNRW’s - with a range limited in Si. 

4. Letb{u,v) <— 6(«, n)U{w}, and Sjn, n) S{u,v)U{Si}. 

If b{u,v) = N{v), let b{u,v) = 0. If S{u,v) contains all 
the possible groups, let S{u, v) = 0. 

One can see from the design of GNRW that, like CNRW, it 
does not alter the stationary distribution of simple random walk 
- no matter how the grouping strategy is designed. Also, just like 
CNRW, GNRW guarantees a smaller or equal asymptotic variance 


Prooe. 


GNRW 

TT,- 


= lim l^Pr{X”=y} 


= lim 

71—^00 




n 

SRW 


= lim 

n—^00 




= lim - ^ Pr{Xi^^ = j} 

77. —F 00 r). ' ^ 


SRW 


(44) 

= 3 } 

(45) 

(46) 

(47) 

(48) 


Eq. j45| > => l |46[ ) because GNRW will iterate all the path blocks 
in N{v) = yJiSi, and each path block has the same probability to 
be accessed {Si is chosen with probability proportional to its size). 
The rest of the proof is straightforward and similar to CNRW’s. □ 


In the following discussions, we first describe the rationale be¬ 
hind GNRW, and then discuss the design of the grouping strategy 
(for the neighbors of v), in other words, the design of groupby func¬ 
tion g{N{v)) = {Si, S 2 ,..., Sm}. To understand the rationale, 
we start by considering two extremes of the grouping-strategy de¬ 
sign. At one extreme is to group neighbors of n in a completely 
random fashion. With this strategy, GNRW is exactly reduced to 
CNRW - i.e., every neighbor of v is circulated, with the order being 
a random permutation. At the other extreme is the ideal scenario for 
GNRW - as illustrated in Figure]^- when nodes leading to similar 
path blocks are grouped together. The intuitive reason why GNRW 
outperforms CNRW in this case is easy to understand from Fig¬ 
ure]^ Circulating among the three groups, instead of attempting 
a group more than once, can make the random walk “propagate” 
faster to the entire graph rather than being “stuck” in just one clus¬ 
ter. 

More formally, the advantage offered by GNRW can be explained 
according to the path-block encoding scheme introduced in Sec- 
tion |3.2| Consider an example in Figure|^where v has 4 neighbors. 
We encode the path block from each neighbor as random variables 
B{'Wi),..., B{'W 4 ). Suppose that neighbors are partitioned into 
two groups: S\_ : {wi,W 2 } and S 2 ■ {w3,W4}. One can see 
that GNRW is like stratified sampling, while CNRW is a process 
of simple random sampling (without replacement). As long as the 
intra-group variance for the two groups is smaller than the popu¬ 
lation variance, GNRW offers overall a lower asymptotic variance. 
To understand why, consider an extreme-case scenario with zero 
intra-group variance (i.e., B{wi) = B{w 2 ), B{w 3 ) = B{wa)). 
One can see that, while GNRW achieves zero overall variance. 
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Segments of alternating groups and path blocks. They are circulated with a period of 4. 


Figure 5: A demo of GNRW 


CNRW still has positive variance due to the inter-group variance 
(i.e., B{wi) / B{w 3 )). 

Having discussed the rationale behind GNRW, we now turn our 
attention to the design of the grouping strategy. One can see from 
the above discussion that the main objective here is the group to¬ 
gether neighbors that lead to similar path blocks - i.e., random 
walks starting from nodes in the same group should share simi¬ 
lar characteristics. We would like to make two observations for the 
design: First, locality is a property widely recognized for social 
networks - i.e., users with similar attribute values (e.g., age, occu¬ 
pation) tend to have similar friends (and therefore lead to similar 
path blocks). Thus, grouping neighbors of v based on any attribute 
is likely to outperform the baseline of random group assignments. 

Second, which attribute to use for group assignments also has 
an implication on the potential usage of samples for analytical pur¬ 
poses. For example, if one knows beforehand that samples taken 
from a social network will be used to estimate the average age of 
all users, then designing the grouping strategy based on user age is 
an ideal design that will likely lead to a more accurate estimation 
of the average age. To understand why, note that with this group¬ 
ing strategy, the random walk is likely to quickly “propagate” to 
users of different age groups, instead of being “stuck” at a tight 
community formed by users of similar ages. Thus, if one knows an 
important aggregate query which the collected samples will be used 
to estimate, then the grouping strategy should be designed accord¬ 
ing to the attribute being aggregated in the query. We shall verify 
this intuition with experimental results in Section [6^ 

4.2 Algorithm GNRW 

Algorithm implementation. Algorithm[^depicts the pseudo code 
for Algorithm GNRW. We note that the data structures we maintain 
are two hash maps: S{u, v) and bg. {u, v). S{u, v) is a mapping 
from (u, v) to the current set of groups that GNRW have accessed 
before based on the route u —>■ u; bg- (u, v) is a mapping from 
{u,v,Si) to the current set of nodes that GNRW have accessed 
before based on the route u —k u and the outgoing group Si. 

Time and space complexity. GNRW requires two hash maps that 
continuously records the outgoing groups and edges for each edge, 
i.e. the key-value pairs: —>■ 5(u, u) and (e„„, S'i) — >■ bg;(ti,w). 

Similar to CNRW, we assume GNRW walks K steps, and it also 
has the amortized expected 0{K) time complexity and 0{K) space 
complexity, because the keys Cuv and {Cuv, Si) are uniformly dis¬ 
tributed among their possible values 

5. DISCUSSIONS 

CNRW applied to Non-Backtracking Random Walk (NB-SRW) 

It is important to note that the idea of CNRW - i.e., changing transi¬ 
tion upon visiting an edge u ^ v from sampling with replacement 
to sampling without replacement - is an idea that can be applied 
to any base random walk algorithm, including both SRW and NB- 
SRW |TT). For example, if we apply the idea to NB-SRW, then the 


Algorithm 2 Groupby Neighbors Random Walk 

/* Given xo = u,Xi = v,a groupby function g{-) */ 

/* And we assume that all S{u, v) and bg^ (u, v) should be ini¬ 
tialized as 0. */ 
for i = 2 —>■ samplesize do 
g{NiXi-3)={Sl,S2,...,Srn} 
CS^{Si,S2,...,Sm}-S{u,v) 
if C'S' 7^0 then 

Si •<— choose a group with probability |S'i|/|C'S'| 

U ■(- Si - bs^ {xi-2,Xi-i) 

if [/ 7 ^ 0 then 

Xi uniformly choose a node from U 
bSi iXi-2,Xi-l) = bSi {Xi-2,Xi-l) U {Xi} 

else 

Xi uniformly choose a node from Si 

bsi{xi-2,Xi-i) = 0 

end if 
else 

Si <— uniformly choose a group from {Si, S 2 , •. •, Sm} 

U ■(- Si - bs^ {xi-2,Xi-i) 

if [/ 7 ^ 0 then 

Xi uniformly choose a node from U 
bSi iXi-2,Xi-l) = bg. {Xi-2,Xi-l) U {Xi} 

else 

Xi uniformly choose a node from Si 

bsiixi-2,Xi-i) = 0 

end if 
end if 
end for 


resulting algorithm (say NB-CNRW) will work as follows: Upon 
visiting u —>■ v, instead of sampling the next node with replacement 
from N{v)\u (like in NB-SRW), we would sample it without re¬ 
placement from N{v)\u. Note the difference between NB-CNRW 
and the CNRW algorithm presented in the paper (which is based on 
SRW): With CNRW, the sampling is done over N{v) while with 
NB-CNRW, the sampling is done over N{v)\u - indeed a carry¬ 
over change from NB-SRW. 

How does graph size affects CNRW and GNRW. First, we note 
that the graph size is unlikely to be a main factor in the historic 
visit probability. To understand why, consider SRW over an undi¬ 
rected graph and the probability of going back to the starting node 
(which, without loss of generality, is the probability for historic 
visit to occur). Note that the probability of going back to the start¬ 
ing node at Step i keeps decreasing with i, until the random walk 
visits the starting node again Q. Thus, the historic visit probabil¬ 
ity mainly depends on the first k steps after visiting a node, where 
A: is a small constant. In more intuitive terms, a random walk is 
mostly likely to go back to a node only a few steps after visit- 
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ing the node (because, after only a few steps, the random walk is 
likely still within a “tightly connected” local neighborhood of the 
node). This essentially means that the historic visit probability is 
unlikely to be sensitive to graph size - after all, even when the graph 
size tends to infinity, the probability of visiting (or re-visiting) a 
node within a constant number of steps is unlikely to change much. 
As an extreme-case example, growing the graph beyond the fc-hop 
neighborhood of the starting node does not change the historic visit 
probability within k steps at all. 

6. EXPERIMENTS 
6.1 Experimental Setup 

Hardware and platform: We conducted all experiments on a com¬ 
puter with Intel Core i3 2.27GHz CPU and 64bit Ubuntu Linux OS. 
Datasets: We tested three types of datasets in the experiments: 
well-known public benchmark datasets that are small subsets of 
real-world social networks, large online social networks such as 
Google Plus and Yelp, and synthetic graphs (for demonstrating 
extreme-case scenarios) - e.g., barbell graphs and small clustered 
graphs. We briefly describe the three types of datasets we used re¬ 
spectively as follows (see the summary of these datasets in Table[TJ. 

Public Benchmark: 

The Facebook dataset is a public benchmark dataset collected 
from (T). It is a previously-captured topological snapshot of Face- 
book and it has been extensively used in the literature (e.g., (14)). 
Specifically, the graph we used is from the ‘T684.edges” file. Youtube 
is another large public benchmark graph collected from (21| . In 
these public benchmark dataset, we simulated a restricted-access 
web interface precisely according to the definition in Section [2T| 
and ran our algorithms over the simulated interface. 

Large Online Social Graphs: 

Google Plu^ To test the scalability of our algorithms over a 
large graph, we performed experiments over a large graph we crawled 
from Google Plus that consists of 240,276 users. We observe that 
the interface provided by Google Social Graph API strictly adheres 
to our access model discussed in Section|2T]- i.e., each API request 
returns the local neighborhood of one user. 

Yelp datasej^ We extracted the largest connected subgraph con¬ 
taining 119,839 users (out of 252,898 users) from the dataset. We 
restored all the dumped JSON data into MongoDB to simulate API 
requests. 

Since we focus on sampling undirected graphs in this paper, for 
datasets that feature directed graphs, we first converted it to an 
undirected one by only keeping edges that appear in both direc¬ 
tions in the original graph. Note by following this conversion strat¬ 
egy, we guarantee that a random walk over the undirected graph 
can also be performed over the original directed graph, with an 
additional step of verifying the existence of the inverse direction 
(resp. V ^ u) before committing to an edge (resp. u —>■ w) in the 
random walk. 

Synthetic Graphs: 

We also tested our algorithms over synthetic graphs, such as bar¬ 
bell graphs and graphs with high clustering coefficients for two 
main purposes: One is to demonstrate the performance of our algo¬ 
rithms over “ill-formed” graphs as these synthetic graphs have very 
small conductance (i.e., highly costly for burning in). The other is 
to control graph parameters such as number of nodes that we can¬ 
not directly control over the above-described real-world graphs. It 

®https://plus.google.com/ 

^http://www.yelp.com/dataset_challenge 


is important to note that our usage of a theoretical graph generation 
model does not indicate a belief of the model being a representation 
of real-world social network topology. 

Algorithms: We implemented and tested five algorithms in the ex¬ 
periments: Simple Random Walk (SRW) GD , Metropolis-Hastings 
Random Walk (MHRW) (^, Non-Backtracking Simple Random 
Walk (NB-SRW) pT) - a state-of-the-art random walk algorithm 
which uses an order-2 Markov Chain, and two algorithms pro¬ 
posed in this paper: Circulated Neighbors Random Walk (CNRW) 
in Section 1^ and Groupby Neighbors Random Walk (GNRW) in 
Section|^ For each algorithm, we ran it with a query budget rang¬ 
ing from 20 to 1000 , and take the returned sample nodes to mea¬ 
sure their quality (see performance measures described below). For 
GNRW, we tested various grouping strategies, as elaborated in Sec- 
tion l 6 . 2 l 

It is important to understand why we included MHRW in the al¬ 
gorithms for testing. Note that, while SRW, NB-SRW and our two 
algorithms all share the same (target) sampling distribution - i.e., 
each node is sampled with probability proportional to its degree - 
MHRW has a different sampling distribution - i.e., the uniform dis¬ 
tribution. Thus, it is impossible to compare the samples returned by 
MHRW with the other algorithms properly. We note that our pur¬ 
pose of including MHRW here is to simply verify what has been 
recently shown in the literature 0 and (TT) - t.e., for practical 
purposes such as aggregate estimation over social networks, the 
performance of MHRW is much worse than the other SRW based 
algorithms, justifying the usage of SRW as a baseline in our design. 

Performance Measures. Recall from Section [23| that a sampling 
algorithm for online social networks should be measured by query 
cost and bias - i.e., the distance between the actual sampling dis¬ 
tribution and the (ideal) target one, which in our case is 7r(n) = 
kv/{2\E\). To measure the query cost, one simply counts the num¬ 
ber of unique queries issued by the sampler. The measurement of 
bias, on the other hand, requires us to consider two different meth¬ 
ods (and three measures) described as follows. 

For a small graph, we measured bias by running the sampler for 
an extremely long amount of time (long enough so that each node 
is sampled multiple times). We then estimated the sampling dis¬ 
tribution by counting the number of times each node is retrieved, 
and compared this distribution with the target distribution to de¬ 
rive the bias. Their distances are measured in two forms: (1) KL- 
divergence j^, and ( 2 ) f 2 -distance (^ between the two distribution 
vectors. Let P and Psam be the ideal and measured sampling dis¬ 
tribution vectors, respectively. 

• To measure the distance in KL-divergence, we compute 

PKL(P||Psam) + PKL(Psam||P) whete 

Dkl{P\\Q) = E In (§Sy) 

• For ^ 2 -norm, we use IIP — Paam1 12 - 

Note that compared with the KL-divergence based measure, the 
^ 2 -norm one is more sensitive to “outliers” - i.e., large ditferences 
on the sampling probability for a single node - hence our usage of 
both measures in the experiments. 

For a large graph like Google Plus and Yelp we used, it is no 
longer feasible to directly measure the sampling probability distri¬ 
bution (because sampling each node multiple times to obtain a re¬ 
liable estimation becomes prohibitively expensive). Thus, we con¬ 
sidered another measure of bias, aggregate estimation error, for 
experiments over large graphs. Specifically, we first used the col¬ 
lected samples to estimate an aggregate over all nodes in the graph 
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nodes 

edges 

average Degree 

average clustering coefficient 

number of triangles 

Facebook 

775 

14006 

36.14 

0.47 

954116 

Google Plus 

240276 

30751120 

255.96 

0.51 

2576826580 

Yelp 

119839 

954116 

15.92 

0.12 

4399166 

Youtube 

1134890 

2987624 

5.26 

0.08 

3056386 

Clustering graph 

90 

1707 

37.93 

0.99 

23780 

Barbell graph 

100 

2451 

49.02 

0.99 

39200 


Table 1: Summary of the datasets in the experiments. 



Fignre 6: Large Google Plus Graph: estimation of average de¬ 
gree. 


- e.g., the average degree or reviews count - and then compare the 
estimation with the ground truth. One can see that, since SRW, 
NB-SRW and both of our algorithms all share the exact same target 
sampling distribution, a sampler with a smaller bias tends to pro¬ 
duce an estimation with lower relative error - justifying our usage 
of it in the experiments. 

6.2 Evaluation 

We start with the experiments that show how SRW, CNRW and 
GNRW have the same stationary distribution. In Figure]^ we ran 
100 instances of each random walk for 10000 steps in two datasets 
(both are from the Facebook dataset (TJ), and then we used the 
samples collected from each random walk to calculate the sampling 
distribution. We ordered the distribution of the nodes by the their 
degree, and we also included the theoretical distribution of all the 
nodes (i.e. the red solid line in the figure). One can see that all the 
three random walks converge to the same stationary distribution. 

We then compare of all five algorithms over the Google Plus 
dataset, with bias measure being the relative error for estimating 
the average degree. Figure depicts the change of relative error 
with query cost. One can make two observations from the fig¬ 
ure. One, our proposed algorithms CNRW and GNRW signifi¬ 
cantly outperform the other algorithms. For example, to achieve 
a relative error of 0.06, CNRW and GNRW only requires a query 
cost of around 486 and 447, respectively, while SRW requires over 
800, NB-SRW requires 795, and MHRW never achieves a relative 
error under 0.08 after issuing 1000 queries. Second, we can also 
observe from the figure that MHRW performs much poorer than 
the other algorithms. Thus, we do not further include MHRW in 
experimental results in the paper. 

With Figure|^establishing the superiority of our algorithms over 
the existing ones with the relative error measure, we also confirmed 
the superiority with the other two measures, KL-divergence and 
f' 2 -norm, this time over the public benchmark dataset. Figure]^ 
depicts the results for Facebook and Youtube. One can observe 
from Figure|^that our CNRW and GNRW algorithms consistently 




(a) Estimate Average Degree (b) Estimate Average Reviews 

Count 


Figure 9: Yelp dataset: GNRW strategies. 


outperform both SRW and SRW according to all three measures 
being tested. In addition, GNRW outperforms CNRW, also for all 
three measures. 

To further study the design of GNRW, specifically the criteria 
for grouping nodes together, we tested GNRW with three different 
grouping strategies: random grouping (i.e., GNRW-By-MD5, as 
we group nodes together according to the MD5 of their IDs), group¬ 
ing by similar degrees (GNRW-By-Degree), and grouping by the 
value of an attribute “reviews count” (GNRW-By-ReviewsCount). 
Figure 1^ depicts the performance of all three strategies. One can 
make an interesting observation from the figure: While all three 
variations significantly outperform the baseline SRW algorithm, 
the best-performing variation indeed differs when the relative er¬ 
ror is computed from different aggregates. Specifically, when the 
aggregate is average degree, GNRW-By-Degree performs the best. 
When the aggregate is average review count, on the other hand, 
GNRW-By-ReviewsCount performs the best. This verifies our dis¬ 
cussions in Section [ 44 ] - i.e., if the aggregate of interest is known 
before hand, choosing the grouping strategy in alignment with the 
aggregate of interest can lead to more accurate aggregate estima¬ 
tions from samples. 

Finally, we studied the performance of our algorithms over two 
“ill formed” graphs, the clustered graph and a barbell graph. The 
clustered graph is combined with 3 complete graphs with graph 
sizes as 10, 30 and 50. We also vary the size of the barbell graph 
from 20 to 56 nodes, to observe the change of performance of the 
algorithms with graph size. The results are shown in Figures 
and[TT] One can see from the results that, even for these ill-formed 
graphs, our proposed algorithms consistently outperform SRW and 
NB-SRW for varying graph sizes, according to all three bias mea¬ 
sures being used. 


7. RELATED WORK 

Sampling from online social networks. Online social networks 
are different from other data mining resources because of their lim¬ 
ited access interface, thus a lot of papers like [12[ |TO| targeted 
the challenges of how to efficiently sample from large graphs. 
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(a) Facebook KL-divergence 




(b) Facebook i! 2 -distance (c) Facebook Estimation error 

Figure 7: Public benchmark datasets 



(d) Youtube Estimation error 



(a) facebook dataset 1 





(b) facebook dataset 2 (c) facebook dataset 1 (zoomed) (d) facebook dataset 2 (zoomed) 

Figure 8: Distribution of SRW, CNRW and GNRW 


With some extend of the global topology (e.g. the id range of all 
the nodes in the graph), o summarized sampling techniques in¬ 
cluding random node sampling, random edge sampling and random 
subgraph sampling in large graphs. Q combines random jump 
and MHRW together to get efficient uniform samples. 03 also 
demonstrated the frontier sampling that leveraged the advantage 
of having uniform random initial nodes. Without global topology, 

03 compared sampling techniques such as Simple Random Walk, 
Metropolis-Flastings Random Walk and traditional Breadth First 
Search (BFS) and Depth First Search (DFS). Q confirmed that 
MFIRW is less efficient than SRW because MHRW mixes slower. 

(H) introduced non-backtracking random walk. Also Q consid¬ 
ered many parallel random walks at the same time. 

Our work extends random walks to higher order MCMCs and 
systematically consider the historical information by introducing 
path blocks, which is fundamentally different from existing tech¬ 
niques. 

Theoretical analyses of the random walk path blocks. Accord¬ 
ing to (T^ , the stratification of a random walk’s path blocks can 
affect the asymptotic variance of its estimation. (TO also applied 
(l6|’s theorem to show that non-backtracking random walk is al¬ 
ways better than SRW. Our work is based on the construction of 
the path blocks, and we further discussed about how to design the 
random walk to make it as a better form of the stratification for the 
path blocks. 

8. CONCLUSIONS 

In this paper, we considered a novel problem of leveraging his¬ 
toric transitions in the design of a higher-ordered MCMC random 
walk technique, in order to enable more efficient sampling of online 
social networks that feature restrictive access interfaces. Specif¬ 
ically, we developed two algorithms: (1) CNRW, which replaces 
the memoryless transition in simple random walk with a memory- 
based, sampling-without-replacement, transition design, and (2) GNRW, 
which further considers the observed attribute values of neighbor¬ 


ing nodes in the transition design. We proved that while CNRW and 
GNRW achieve the exact same target (sampling) distribution as tra¬ 
ditional simple random walks, they offer provably better (or equal) 
efficiency no matter what the underlying graph topology is. We 
also demonstrated the superiority of CNRW and GNRW over base¬ 
line and state-of-the-art sampling techniques through experimental 
studies on multiple real-world online social networks as well as 
synthetic graphs. 
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(a) KL-divergence 



(b) £ 2 -distance 



(c) Estimation error 


Figure 10: Synthetic datasets: clustered graph 





(a) KL-divergence 


(b) £ 2 -distance 


(c) Estimation error 


Figure 11: Barbell graph size analytics. 
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